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1 Abstract 

The aim of this technical report is to give a short overview of known techniques for network 
tomography (introduced in the paper of Vardi (1996)), extended by a Bayesian approach 
originating Tebaldi and West (1998). Since the studies of A.K. Erlang (1878-1929) on 
telephone networks in the last millennium, lots of needs are seen in todays applications 
of networks and network tomography, so for instance networks are a critical component 
of the information structure supporting finance, commerce and even civil and national 
defence. An attack on a network can be performed as an intrusion in the network or as 
sending a lot of fault information and disturbing the network flow. Such attacks can be 
detected by modelling the traffic flows in a network, by counting the source destination 
packets and even by measuring counts over time and by drawing a comparison with this 
'time series' for instance. 



2 Introduction 

In order to know, find and understand the typical denial of service attack, it is necessary 
to understand the principle of protocols transmitted over a network. As an example we 
can look at the TCP protocol. The TCP-header is illustrated in figure [H 

One of the main features of TCP is the concept of ports. Each session to or from an 
application is assigned a source port and a destination port. A source destination port 
pairing is used to disambiguate multiple ongoing sessions between machines. TCP also 
implements a two way connection scheme based on the usage of flags in the header. A 
common interaction on a network will be, that the client sends a packet with a synchronize 
flag set indication a communication. The destination then opens a port for the commu- 
nication based on that request. The server is responding with both a synchronize and 
an acknowledgement flag and then finally the client responds with an acknowledgement 
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Figure 1: The TCP segment 



flag. This is the three way handshake (see figure [2]), which sets up the connection and 
allows a two way communication (description in a very simply way). One idea of an 
attack is to flood a computer with bogus requests or to cause it to devote resources to the 
attack at the expense of the legitimate user of the system. The attacker is just sending 
packets that request for a communication but never completes the three way handshake. 
Another attack is to send packets via the network that are full of errors so that the victim 
computer is forced to spend time with these errors. This results in a number of reset (or 
other) packets with no obvious session. 

We are able to compute detection probabilities in the following way (see Marchette (2005) 
and Moore et al. (2001)). IP addresses are unique 32-bit numeric address of every host 
and router on the Internet. '"Spoofing"' denotes the changing of the source address to 
a nonexistent address. We simply collect the packets with no obvious session. Assume 
the spoofed IP addresses are generated randomly, uniformly and independently on all 2 32 
addresses. We assume that d packets are sent in an attack on a victim in a network. If 
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Figure 2: Three- Way-Handshake 
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we monitor all packets to IP addresses, the probability of detecting an attack is given by 

d 



P('detect an attack') = 1 - (l - — J 



with expected number of disturbing packets is given by 

wd 

where w denotes the number of monitored IP addresses. To infer how many packets were 
originally sent we need to estimate the severity of an attack. Under the assumption of 
independence probability of defining j packets as attacking packets is given by 

d\ / w \3 / w \ d ~j 



P U 'packets') = g) (£)' (l - £) 



and the maximum likelihood estimate for d is given by 

J2 32 



d = 



w 



So if we see j packets, we can estimate the number of such attacks. From the literature 
it is known that the assumption of independence the number of attack packets between 
two detected packets, is given by 

s=l 

where N is the number of IP addresses used for randomly simulating those IP addresses 
that are used by an attacker, w is the number of monitored IP addresses. 
Another more sophisticated approach will be in monitoring and modelling the behaviour 
of the packets in the flow through the network. Looking at the traffic we want to estimate 
the network flow intensity. The aim will be to estimate the traffic intensities by two ways. 
First we are able to measure source destination (directed) pairs of nodes and then perform 
repeated measurements on the nodes to count packets (for example phone calls in routing, 
emails and so on) transmitted over a communication network. One main assumption in 
our mathematical model is, that we deal with a strongly connected network, which means, 
that there always exists a directed path between any two nodes. 

When we study the architecture of networks, we distinguish between two main groups 
of networks - those that are deterministic (fixed routing) networks and those that are 
random (Markovian routing) networks. In the first group we deal with directed paths 
between the nodes, that are fixed and known for each communication. For the second 
group the travelling of information (sending of packets) is determined by a fixed known 
Markov chain. It can be easily seen that random routing is a special case of fixed routing. 
A source destination pair (short SD) transports information from the source to the desti- 
nation over a direct connected path in the network. We introduce c as the number of SD 
pairs, which can be calculated from the number of nodes n by 



c = (n — \)n. 
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Figure 3: Example of a directed network with four nodes 
The number of transmitted information of a SD pair j at measurement period k is given by 

(k) 

X- , which, like in classical "teletraffic theory" is assumed to follow a Poisson distribution 
with parameter Xj, i.e. 

Xf ~ Po(Xj). 

We can formulate the SD transmission vector at period k by X^' = (X[ , X^) 1 . 
For the modelling of the problem we need to introduce the r x c routing matrix A for a 
deterministic network SIS db (0, l)-matrix given by 

A = (ay). 

We get ciij = 1 if the link i belongs to the directed path of the SD pair and ay = if the 
link i does not belong to the directed path of the SD pair. 



Vardi (1996) gives several example networks, for instance such network is given in figure 
[3l It is a four-node directed network and consists of 

c=(n-l)-n = 3-4=12 

SD pairs and seven directed links. 

For better reading the routing matrix A is given for better reading in the table [2] with Y 
and Xj describing the structure displayed in table [2J 
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Table 1: Routing matrix A 



The measured data on all links of the network is given by Y( fc) = (Y]_ , Yr), where 
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Table 2: Structure represented in A 

r denotes all directed links in the network with the property r = 0{n) and Or. The 
formulation of the network model is given by 

Y = AX (1) 

and if we consider measurement periods k, we rewrite this as 

y( fc ) — AX^. 

The goal is to estimate A = (Ai, A c ) from Y^, Y^. The following questions turn 
up: 

— Are the parameters identifiable? 

— Are the estimates consistent? 

The model we deal with is a linear one, but we cannot use a linear regression nor a 
random effect model because we deal with a (0, l)-matrix A, nonnegativity constraints 
on the parameters and the Poisson assumption for the number of transmitted messages. 
The identifiability of the parameter vector A can be easily verified by the following lemma 
(see Vardi (1996)). 



Lemma If the columns of the routing matrix A are all distinct and each column has 
at least one non-zero entry a^-, then A is identifiable. 

The proof follows the principle of induction and we refer to Vardi (1996). If we find 
a zero column in the routing matrix A, we can conclude that the corresponding SD pair 
is not connected by a path, and if there is a zero row, then the corresponding link is not 
a part of the network. This observations leads us to the following assertion: 

Lemma If c > 2 r — 1, then some rates Ai, A c cannot be estimated separately. 



3 Parameter Estimation 

With this model setting we can apply classical maximum likelihood estimation (MLE), 
also iterative expectation maximisation (EM) algorithms are also proposed in the litera- 
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ture. Using maximum likelihood estimation we can expect problems due to the nonlinear 
constraints. The structure of the log-likelihood function, which is to maximised, is hard 
to evaluate. The likelihood equations read 

-?L = for % = 1, ...,c, 
oXi 

where / denotes the logarithm of the likelihood function L. In vector notation this can be 
expressed as 



1 K 

- E X [X^\Y^ = AX«] - A = 



k=i 

X( fe ) are the complete (unobserved) data and are the incomplete (observed) data. A 
formulation of the EM algorithm under the assumption of independence of the k compo- 
nents is given by 

A(n+1) = ^E^[ x(fc) i Y(fc) ' A(n) ]- 
k=i 

A problem which we mark out here is, that the above given summands are hard to 
calculate, since the solutions are located in the integer range. For finding a maximum it 
is necessary, that the log-likelihood is concave. By evaluating the Hessian matrix H given 

by 

H ' &H ' 



d\d\j 



we see that this matrix is not necessarily negative semidefinite, so / is not necessarily 
concave (see Vardi (1996)). A resolution of this problem is given by the following propo- 
sition 

Proposition 1 If A* is an interior point, then for large K, I is concave in the neigh- 
bourhood of A*. 

Other estimation methods instead of MLE are based on normal approximations. Under 
the assumption of normality: 

X ~ N(X, A), 

where A = diag(A) is a c x c matrix, the joint distribution of (X*, Y*)* assuming Y = AX 
to hold is given by 

x )-n (( x H A AA * 

Y J c+r \\ AA J ' V AA AAA* 
The conditional distribution of X given Y is given by 

X|Y ~ N C (X + AA*(AAA)- X (Y - AA), A - A*(AAA*) _1 AA) 
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so we are able to approximate 

E[X\Y, A] « A + AA^AAA^-^Y - AA) 
and get the following iteration formulae 

K 

h 



1 K 

A (n+i) = _ Yy\ {n) + AWA^AAWA'j-^YW - AA (n) )]. 



k=i 



At this point we should mention, that a priori all Aj >> 0. Here we have to expect 
nonnegligible approximation errors. Because of the matrix inversion some of the 
can be negative. Another approach that has been proposed in the literature is to assume 
the sum of the Y( fc ) to be normally distributed, 



1 K 

Y = — Y(k) ~ A r (AA, A"" 1 AAA*). 



k=i 

The log-likelihood of Y is given by 

/(A) = - log | AAA* | - K(Y - AA)*(AAA*)- 1 (Y - AA) -> max . 

Aj>0 

Another approach to the estimation problem is based on sample moments where Y is 
completely determined by the mean vector AA and the covariance matrix AAA* of Y 
and under the usage of the first and second moment we get 

• E(Y) = Y = AA 

• Cov(Y l} Y h ) = ±E k Y^Y™ - %Y h = E/=i OiiawA, for 1 < % < h < r 

where the first moment equation is independent of the Poisson assumption and the second 
moment equation strongly depends on the Poisson model. 



4 Prior Models for Network Tomography 

In this section we will investigate the problem of computing and summarizing the joint 
posterior distribution of p(X|Y) for all observed messages of SD pairs given the observed 
link counts Y. For the posterior distribution we need a model for the prior distribution 
p(X) to be tied together with Y = AX. Under the assumption, that the Xj are inde- 
pendently Poisson distributed over the routs j, the prior specification is completed by a 
prior of A. The joint distribution of the model is then given by 

p(X,A)=p(A)f[Af' e ^. 

i=i 

We are interested in estimation of X since we can infer X and A from the joint distri- 
bution. On a more advanced modelling standard we can also use hierarchical modeling 
for the parameters Aj. It is common in literature to model such hyperparameters by a 
normal distribution N(/j,, a). Posterior computations are difficult to evaluate analytically. 
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For example, it is unrealistic to evaluate them for large networks. For this purpose we 
introduce some iterative MCMC simulation algorithms. As an example consider Gibbs 
sampling, which iteratively resamples from the conditional posterior for elements of the 
X and A variables. Under the usage of 

c 

p(A|X,Y)=p(A|X) = IJp(A i |X i ), 

i=i 

whose components have the form of the prior density p{\j) multiplied by a gamma 
function arising in the Poisson based likelihood function. It is possible to simulate new 
A values as a set of independent drawing from the univariate posterior density. If the 
prior densities are gamma densities or a mixture of gamma densities, these drawings 
are trivially made from the corresponding gamma or mixed gamma posterior densities. 
Otherwise, as proposed in literature, we have to use the rejection method or embedded 
Metropolis Hasting steps in the MCMC scheme in the standard Metropolis - Gibbs frame- 
work. 

The theoretical structure of a general network model leads to the following theoretical 
result for computing samples from the conditional posteriors. Let A be fixed and focus 
on the conditional posterior p(X|A, Y), then we can use the following theorem due to 
Tebaldi and West (1998). 

Proposition 2 In the network model Y = AX and under the assumption that A has full 
rank r, we can reorder the columns of A such, that the revised routing matrix has the 
form 

A=[A!,A 2 ] 

where A x is a nonsingular r x r matrix. By similar reordering the elements of the vector 
X and partitioning as X* = (X^, X 2 ), it follows that 

X 1 = Ar 1 (Y-A 2 X 2 ). 

This result easily follows from the fact that Y = AX = A1X1 + A 2 X 2 . 

The full rank assumption is satisfied by all networks in real world. Otherwise there 
is a redundancy in specification, and one or more rows of A can be deleted to get linear 
independent rows. The result in the last proposition implies that, given Y and the 
assumed values of the (c — r) route counts in X 2 , we are able to compute directly the 
remaining r route flows simply based on the algebraic structure of the routing matrix. 
For the reordering of the matrix A we can use the QR decomposition of arbitrary full 
rank matrices. After the QR decomposition we get an r x r orthogonal matrix Q and 
an r x c upper triangular matrix R, the first r columns of which correspond to r linear 
independent columns of A. These are identified by a permutation of column indices. 
With this knowledge we can deduce, that the conditional distribution p(X|A,Y) lies 
in the c — r dimensional subspace defined by the partition A = [A 1; A 2 ]. After the 
partitioning of the routing matrix A the posterior has the form 



p(X 1 |X a ,A,Y)p(X a |A,Y), 
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where p(Xi|X 2 , A, Y) is degenerated at Xi = A 1 1 (Y - A 2 X 2 ) with X 2 = (X r+1 , .., X c )' 
and Xi = (Xi, ...,X r ) 1 defined as above. The conditional is given by 

c , x 

p(X 2 |A,Y)ocn^T 

a=l 

with the support defined by X a > for all a — 1, . . . , c. It is the product of independent 
Poisson priors for the Xi constrained by the model and the reordering. The utility of this 
expression is in delivering the set of complete conditional posteriors for elements of the 
X 2 vector to form a part of the iterative simulation approach to posterior analysis. Let 's 
now consider each element Xi of X 2 , {i — r + 1, . . . , c), and write X 2 _i for the remaining 
elements. The conditional distribution p(Xj|X 2; _i, A, Y) is given by 

p(X i |X a ,_ 1 ,A,Y)<x^JI^ > 

a=L 

over the support of the expression above. The linear constraints on G {r + 1, . . . , c}, 
are of the form Xi > di and Xi < e*, where the values di and are functions of the 
conditioning values of X 2) _ ; and Y. Together with Xi > we obtain at most a set of 
r + 1 constraints on Xi. It is computationally very burdensome to evaluate directly these 
constraints and identify their intersection. So we can make direct simulations. 
For the simulation of the full posterior p(X, A|Y) we need now fixed starting values of 
the route counts X. We can apply the following algorithm according to Tebaldi and West 
(1998): 

Algorithm 

1. Draw sampled values of the rates A from c conditionally independent posteriors 
p(X a \X a ), 

2. conditioning on these values of A simulate a new X vector by sequencing through 
i — r + 1, . . . , c, and at each step sample a new Xi with the conditioning elements 
from the X 2j _i set at their most recent sampled values, 

3. iterate. 

This is a known standard Gibbs sampling setup. Scalar elements of both A and X are 
resampled from the relevant distribution conditional on most recently simulated values 
of all other uncertain quantities. In step 2 we require evaluation of the support which is 
best done by a simulation method such as embedded Metropolis-Hastings steps. We note 
that from Y = AX it is possible to identify bounds on each A suitable range for the 
proposal distribution can be computed from that. 

5 Usage of Bayes Factors for Modelling of Network 
Traffic 

If there is a sequence of packets transmitted over a network, we can evaluate a statistical 
profile of that sequence based on the information of the header and compare this to similar 
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sequences in the past. This historical behaviour can be saved in a stochastic matrix where 
each element of the matrix is given by 

Pjku = P('SD' = fc|'SD before' = j, 'IP of sender' = u) 

Since we know the header in the packet, we are able to model the behaviour of the sender 
over time. We can base an analysis on these matrices. DuMouchel (1999) has made a 
similar approach for modelling of the behaviour of commands in a shell. Since we use some 
kind of categorical data, we are able to use the multinomial distribution as proposed in 
the literature. For Bayesian inference we can use the Dirichlet (prior) distribution, which 
is the natural conjugate distribution to the multinomial distribution. 
Let p = (pi, ...,Pk) be a random vector which is Dirichlet distributed with the density 

JKP) T(a k ) 

with at > for all i. The multinomial probability for a count data vector n = (ni, uk) 
with n = J2 k n k is given by 

P(n\p) = nlTT 

k nk] 



From above formulas we get the marginal distribution 

p, x = w! -i-r a k {a k + 1) • ■ ■ ■ • (at + n k - 1) 

[Tl) ~ a(a + 1) • . . . • (& + n - 1) \1 n k \ 

with a = J2 k a k . The posterior distribution is given by the following Dirichlet distribution 



n k +a k -l 
Pk 



P WB )= (a+ft -i) l n^_ 5f 



On the idea that one user u in the network generates a sequence of T + 1 packets 
Co,Ci,...,Ct we can build the following hypotheses for a test of sending packets that 
disturb the network 

H : P(C t = k\C t -i = j) = p jku 
H 1 :P{C t = k\C t . 1 =j)=Q k 

where 



(Qi, Qk) ~ Dirichlet (a i, a ok ). 

We make the assumption that the null hypothesis H says that a legitimate user is gener- 
ating packets out of the profiles of the transition probabilities. The alternative hypothesis 
Hi says, that T packets are sent through the network, are drawn randomly and indepen- 
dently from a probability vector following a Dirichlet distribution with given hyperparam- 
eters. These hyperparameters have to be estimated. Hi is more general than H since we 
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do not know Q in comparison with the fully specified P u = (j)jku)- H is not nested in 
Hi. For checking the practicability we suggest the usage of Bayes factors BF given by 



BF = 



P(C , • 


.., Ct 


m 


P(C ,. 


.., Ct 


\Ho) 



for inference. For large BF we will prefer the alternative hypotheses. Instead of BF often 

x = log(BF) 

is used, which is called the "weight of evidence" . We can see that modelling the behaviour 
using the network with the network tomography and combining it with the concepts of 
Bayes factors we are able to implement a large apparatus for monitoring networks and to 
draw a conclusion whether there is an attack on our monitored network. 

6 Conclusions 

In this paper we gave an overview of the current literature in network modelling and 
network tomography. We extended that field by developing a method for monitoring 
networks with Bayes factors for testing hypothesis testing whether there is an intruder 
in the network, who performs several forms of attacks. This method can be implemented 
with the usage of control charts which are common in quality control for monitoring 
networks. Further work will be in random routing networks - for an overview of current 
applications we refer to Vardi (1996). The methodologies developed here should be also 
applicable for these kinds of networks. Results and implementation will be given in further 
technical reports. 
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